Applying Math

Stories from the trenches of science

Improving a Clustering Algorithm of Rodriguez and Laio

You may download this notebook here.

A recent Science paper by Rodriguez and Laio proposed a simple method for clustering using two heuristics: a local density $\rho_j$ and the minimum distance to a point of higher density $\delta_j$. In their experiments, they then build clusters by noting the cluster centers are outliers that have simultaneously high $\rho_j$ and $\delta_j$. Their definitions are remarkably simple. The local density is given by

$$ \rho_j = \sum_k H( d_c - d_{j,k} ) $$

where $H$ is the Heaviside function (one if the argument is greater than zero, zero otherwise) and $d_{j,k}$ is the pairwise distance between points $j$ and $k$. Then distance to a point with higher density is simply

$$ \delta_j = \min_{j: \rho_k > \rho_j} d_{j,k}.$$

In this memo, we try their approach using a few datasets: the first two sets similar to their examples and then using a selection of the NIST handwriting database. Although they claim their approach is not sensitive cutoff $d_c$, every case tried here has sensitive dependence on $d_c$, producing duplicate cluster centers unless $d_c$ is chosen properly. Using a smoother density kernel, such as the $\chi^2_p$ function with $p$ degrees of freedom, we can produce more robust cluster centers with fewer duplication.

Implementation

Implementing these functions in Python is rather straightforward. Rather than explicitly forming the pairwise distance matrix, we provide a metric (for which sklearn's is the prototype) and generate the distances on the fly. The following function provides the vector $\rho_j$.

In [2]:
def local_density(metric, data, cutoff):
    """Local density function from RL14."""
    n = data.shape[0]
    rho = np.zeros(n)
    for j in range(n):
        x = data[j]
        dist = metric(x, data).reshape(n)
        rho[j] += np.sum(dist < cutoff)
    return rho

Similarly we can compute the vector of $\delta_j$ using $\rho_j$.

In [3]:
def higher_density_distance(local_density, metric, data):
    """Minimum distance to point with higher density from RL14."""
    n = local_density.shape[0]
    delta = np.zeros(n)
    for j in range(n):
        # argwhere by default returns a 2-tensor, where a tuple
        greater = np.where(local_density > local_density[j])[0]
        x = data[j]
        if len(greater) == 0:
            # Highest density point
            dist = metric(x,data)
            delta[j] = np.max(dist)
        else:
            dist = metric(x,data[greater])
            delta[j] = np.min(dist)
    return delta

To generate plots of $\rho_j$ vs. $\delta_j$, we define the following functions.

In [177]:
Expand Code

Two Clusters and Noise

This first example mimics the first example that appears in their paper. We have two well separated clusters plus noise. They do not mention which distribution they draw their two clusters and noise from, so here we simply generate two tight normally distributed clusters plus one with a large variance, corresponding to noise.

In [199]:
Expand Code
In [149]:
Expand Code
Out[149]:
cutoff:

In the above example, we have a hard time replicating the excellent performance of Figure 1B. In their figure, the local density seems to take on real values, rather than being restricted to integers as their definition of $\rho_j$ allows. This may be a visualization artifact to allow the number of low lying points to be distinguished. However, no value of $d_c$ seems to make two unique representatives of each class appear. Instead, duplicates often appear from the same class. Their construction of their first example and Figure 3(A,C,D) have points with very similar minimum distances, which is not typical of a uniform distribution we used above. This may help keep $\delta_j$ small on all but one points.

Alternative Distance Metric

The choice of the Heaviside function in the definition of $\rho_j$ was surely made for simplicity. Of course, we could try another definition. For example, we could use a $\chi$-Squared distribution with $p$ degrees of freedom, defining $$ \rho_j = \sum_k \chi_p^2\left( \frac{d_c - d_{j,k}}{\sigma^2} \right). $$

In [150]:
def chi_density(metric, data, sigma, df):
    """Local density function from RL14.  This is not an efficient implementation!"""
    n = data.shape[0]
    rho = np.zeros(n)
    for j in range(n):
        x = data[j]
        dist = metric(x, data).reshape(n)
        # prevent overflows when df=1 that occur for dist[j]=0
        rho[j] += np.sum(chi2(df).pdf(dist[0:j]**2/sigma**2))
        rho[j] += np.sum(chi2(df).pdf(dist[j+1:]**2/sigma**2))
    return rho
In [180]:
Expand Code
In [163]:
StaticInteract(lambda **kw: generate_rho_delta_chi(euclidean_distances, data, kw['sigma'], kw['df'], labels = labels),
               sigma=RangeWidget(1,5,1), df = RadioWidget(range(1,5)))
Out[163]:
df: 1: 2: 3: 4:
sigma:

This $\chi^2$ distance metric seems much more robust at separating the two cluster centers. The parameter $\sigma$ seems to have minimal impact.

Gaussian Clusters Example

Rodriguez and Laio's second example in Figure 2 shows a set of points drawn from a mixture (combination) of several populations, each with a distinct high density center with density decreasing away from center. Here we replicate that with two normal distributions.

In [198]:
Expand Code
In [156]:
StaticInteract(lambda cutoff: sweep_cutoff(euclidean_distances, data, cutoff, labels = labels), cutoff=RangeWidget(0,0.1,0.005))
Out[156]:
cutoff:

Again, we see that the Heaviside based density has difficultly separating the two clusters uniquely. Oftentimes, we get duplicate node centers. However, if we use the $\chi^2$ based density, duplicate do not appear.

In [162]:
StaticInteract(lambda **kw: generate_rho_delta_chi(euclidean_distances, data, kw['sigma'], kw['df'], labels = labels),
               sigma=RangeWidget(1,20,1), df = RadioWidget(range(1,6)))
Out[162]:
df: 1: 2: 3: 4: 5:
sigma:

NIST Handwriting Example

To test this approach, we next consider the standard NIST Handwritten Digits dataset that is included with SciKit-Learn.

In [164]:
from sklearn import datasets
digits = datasets.load_digits()
n = digits['data'].shape[0]

Here is an example of what these images look like (taken from Joshua Bloom's notes).

In [165]:
plt.figure(figsize=(16, 6))
for i in range(10):
    plt.subplot(1,10,i)
    plt.imshow(digits['images'][i+10],cmap='gray',axes=None,interpolation='nearest')
    plt.grid(visible=False)

To pick a range of distance cutoffs, we look a typical entries distances to other points.

In [134]:
dist = euclidean_distances(digits['data'][0], digits['data']).reshape(n)
plt.hist(dist,50);
trash = plt.title('Histogram of Euclidean Distances');

The following plot shows how different values of cutoff $d_c$ affect the $\rho_j$ vs. $\delta_j$ plot.

In [143]:
StaticInteract(lambda cutoff: sweep_cutoff(euclidean_distances, digits['data'], cutoff, labels = digits['target']), 
               cutoff=RangeWidget(10,70,2))
Out[143]:
cutoff:

Several values of the cutoff $d_c$ provide reasonable separation for each digit. However, we do not see the clear separation as we did in the preceeding examples; nor should we expect such separation on real data. The $\chi^2$ density kernel gives similar performance below.

In [202]:
from math import sqrt
StaticInteract(lambda **kw: 
               generate_rho_delta_chi(euclidean_distances, digits['data'], 60/sqrt(kw['df']), kw['df'], labels = digits['target']),
               df = RangeWidget(1,64,1))
Out[202]:
df:

Conclusions

There is a deep rabbit hole of different possible density kernels. The Heavyside density kernel is appealing due to its simplicity, but performs poorly for small numbers of points, since it cannot offer much resolution. The $\chi^2$ density kernel is a logical choice when we assume that each point is perturbed by idd Gaussian noise. What choice of density kernel, like most parameters in clustering algorithms, will depend on the characteristics of the dataset.